CoCalc -- SDSS Classification.ipynb

GitHub Repository: aaskay/Sloan-Digital-Sky-Survey
Path: blob/master/SDSS Classification.ipynb
⁵⁸ views

Kernel: Python 3

I was searching for an astronomy dataset on Kaggle and came across this project (https://www.kaggle.com/lucidlenn/data-analysis-and-classification-using-xgboost). I'd like to give credit to Lennart Grosser for a couple of the plots that I use in this notebook.

Getting the data

I used the CasJobs website which offers a SQL-based interface to query the databases that contain the SDSS data. The CasJobs system allows you to save the data returned from the query in csv format.

For more information about how to get data from the SDSS see their Data Access Guide:

http://www.sdss.org/dr14/data_access/

I used the following query to retreive the data:

SELECT TOP 100000
p.objid,
p.ra,
p.dec,
p.cModelMag_u,
p.cModelMag_g,
p.cModelMag_r,
p.cModelMag_i,
p.cModelMag_z,
p.modelMag_u,
p.modelMag_g,
p.modelMag_r,
p.modelMag_i,
p.modelMag_z,
p.psfMag_u,
p.psfMag_g,
p.psfMag_r,
p.psfMag_i,
p.psfMag_z,
p.petroMag_u,
p.petroMag_g,
p.petroMag_r,
p.petroMag_i,
p.petroMag_z,
s.specobjid,
s.class,
s.z as redshift
FROM PhotoObj AS p
JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE
p.cModelMagErr_u BETWEEN 0 AND 0.5
AND g BETWEEN 0 AND 20

In [72]:

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

#import xgboost as xgb
import seaborn as sns
from xgboost.sklearn import XGBClassifier
from sklearn import metrics   #Additional scklearn functions

sns.set_style('whitegrid')

Import the data

Define the columns that we'll use and import the data from the csv file. The cModelMag filter columns were chosen based on information at http://www.sdss3.org/dr8/algorithms/magnitudes.php#cmodel. Specifically, under the section titled 'Which Magnitude Should I Use?', we have the following: "...the cmodel magnitude is now an adequate proxy to use as a universal magnitude for all types of objects".

In [2]:

myFeatures = ['ra','dec','cModelMag_u','cModelMag_g','cModelMag_r','cModelMag_i','cModelMag_z','redshift']
myTargets = ['class']
sdss_data = pd.read_csv('SDSS_Small.csv',usecols=myFeatures+myTargets)

In [3]:

sdss_data.describe()

Out[3]:

In [4]:

sdss_data.shape

Out[4]:

(100000, 9)

In [5]:

sdss_data.head()

Out[5]:

Get a count of each type of class we are working to predict.

In [6]:

sdss_data['class'].value_counts()

Out[6]:

GALAXY    52401
STAR      30367
QSO       17232
Name: class, dtype: int64

We see that our data is not balanced between the classes.

In [7]:

sdss_features = sdss_data[myFeatures]
sdss_targets = sdss_data[myTargets]

In [8]:

sdss_features.head()

Out[8]:

In [209]:

sns.pairplot(x_vars='ra', y_vars='dec', data=sdss_data, hue='class',palette="Set1", size=5,
             aspect=2,markers=["o", "s", "D"],plot_kws={'alpha': 0.25})
plt.title('Equatorial coordinates')

Out[209]:

<matplotlib.text.Text at 0x198b02133c8>

Looking at the above plot of right ascention and declination for each of our class types, we see that there is no clear seperation in the positions of our objects. We can conclude that these features will not add any real predictive value to our model and therefore we can feel confident in dropping them from our dataset.

In [9]:

sdss_features = sdss_features.drop(['ra','dec'],axis=1)

In [10]:

sdss_features.head()

Out[10]:

Now are dataset is only in terms of frequency filters and redshift.

Redshift

Looking at the redshift for each class of object, we see that the further away an object is the more redshifted it appears. From these graphs, it looks like the redshift will be an important feature in classifying observations.

In [170]:

fig, axes = plt.subplots(nrows=1, ncols=3,figsize=(16, 4))
ax0, ax1, ax2 = axes.flatten()

n_bins = 20
x = sdss_data[sdss_data['class']=='STAR'].redshift
y = sdss_data[sdss_data['class']=='GALAXY'].redshift
z = sdss_data[sdss_data['class']=='QSO'].redshift

ax0.hist(x, bins = n_bins)
ax0.set_title('Star')

ax1.hist(y, bins = n_bins)
ax1.set_title('Galaxy')

ax2.hist(z, bins = n_bins)
ax2.set_title('QSO')

fig.tight_layout()
plt.show()

Out[170]:

The plots above confirm our intuition that the further an object is from the observer, the more redshifted it will be. Redshift will be a good predictor value but unfortunately, there is a bit of overlap between the reshift of some galaxies and some QSO's.

Feature Corrolation Plot:

In [228]:

sns.set(style="white")

# Compute the correlation matrix
corr = sdss_features.corr()

# Set up the matplotlib figure
fig, axes = plt.subplots(figsize=(8, 8))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.yticks(rotation=0)
plt.xticks(rotation=45)

plt.show()

Out[228]:

We can see some correlation between the cModelMagnitudes for the u,g, and r filters as well as between the i and z filters. There also seems to be some correlation between the cModelMagnitudes for the u,g, and r filters and redshift.

In [13]:

sdss_standardized_features = preprocessing.StandardScaler().fit(sdss_features).transform(sdss_features)
features = np.array(sdss_standardized_features)
targets = np.array(sdss_targets)

Split the data into test and training subsets

In [14]:

training_data, test_data, training_targets, test_targets = train_test_split(features, targets, test_size=0.25, random_state=42)

Encode the targets. XGBoost cannot use string classes directly

In [15]:

# encode string class values as integers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(training_targets.ravel())
training_targets = label_encoder.transform(training_targets.ravel())
label_encoder = label_encoder.fit(test_targets.ravel())
test_targets = label_encoder.transform(test_targets.ravel())

Define a basic XGBoost classifier

In [18]:

# max_delta_step set to one since we have an embalance in classes.  Can help with convergence.
xgb = XGBClassifier(max_delta_step=1,random_state=42)

Fitting an XGBoost model with default parameters

In [19]:

xgb.fit(training_data, training_targets)

Out[19]:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=1,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=42,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

Above we see the default parameters used by the classifier.

Make prediction and look at the accuracy

In [20]:

predictions = xgb.predict(test_data)
xgb_accuracy = (predictions == test_targets).sum().astype(float) / len(predictions)*100
print("XGBoost's prediction accuracy is: %3.2f" % (xgb_accuracy))

Out[20]:

XGBoost's prediction accuracy is: 97.64

Use sklearn's RandomizedSearchCV to try and tune some parameters

In [98]:

n_estimators = range(800,1600,100)
max_tree_depth = [8] #range(3,9)
eta = [0.1] #[0.01,0.05,0.1,0.5,1] #learning rate
lmbda = [1,2,3,4] #[0.5,1,2]  #L2 regularization term on weights
min_child_wt = [1,2,3]
gamma = [0,0.1,0.5] #[0,1,2,4]  #Minimum loss reduction required to make a further partition on a leaf node of the tree.

In [99]:

xgbParams = {
    'learning_rate':eta,
    'gamma':gamma,
    'n_estimators':n_estimators,
    'reg_lambda':lmbda,
    'max_depth':max_tree_depth,
    'min_child_weight':min_child_wt
}

In [100]:

clf = RandomizedSearchCV(xgb,xgbParams)

In [101]:

clf.fit(training_data, training_targets)

Out[101]:

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=1,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=42,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
          fit_params=None, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'learning_rate': [0.1], 'gamma': [0, 0.1, 0.5], 'n_estimators': range(800, 1600, 100), 'reg_lambda': [1, 2, 3, 4], 'max_depth': [8], 'min_child_weight': [1, 2, 3]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

In [102]:

clf_best_parameters = clf.best_estimator_
print(clf_best_parameters)

Out[102]:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0.5, learning_rate=0.1, max_delta_step=1,
       max_depth=8, min_child_weight=1, missing=None, n_estimators=900,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=42,
       reg_alpha=0, reg_lambda=2, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [103]:

preds = clf.predict(test_data)

In [104]:

xgb_accuracy = (preds == test_targets).sum().astype(float) / len(preds)*100
print("XGBoost's prediction accuracy is: %3.2f" % (xgb_accuracy))

Out[104]:

XGBoost's prediction accuracy is: 98.19

We achieve a slight improvement by searching over a range of parameter values to try to hypertune the learning parameters of the model.

The Confusion Matrix

In [42]:

from sklearn.metrics import confusion_matrix
import itertools

In [105]:

cm = confusion_matrix(test_targets, preds)
classes = label_encoder.inverse_transform(test_targets)
classes = np.unique(classes)

In [106]:

plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title('Confusion matrix')
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

#fmt = '.2f'
fmt = 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    plt.text(j, i, format(cm[i, j], fmt),
             horizontalalignment="center",
             color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

Out[106]:

The results are in line with what we saw in the redshift plots for the different classes of objects. Galaxies and QSO's are much harder to differentiate.

We have used the XGBoost classifier to predict with 98 percent accuracy the class (type) of stellar object for a given SDSS observation. Further hypertuning of parameters might acheive a slight increase in accuracy but further research into the features available in the data would probably be the best way to increase accuracy of the model in its ability to differentiate galaxies from QSO's.

In [ ]: